Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stream list to arrays and optimize stream lookup with SSE2 #508

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Lastique
Copy link
Contributor

@Lastique Lastique commented Sep 25, 2020

The stream list in the SRTP context is now implemented with two arrays: an array of SSRCs and an array of pointers to the streams corresponding to the SSRCs. The streams no longer form a single linked list.

Stream lookup by SSRC is now performed over the array of SSRCs, which is considerably faster because it is more cache-friendly. Additionally, the lookup is optimized for SSE2, which provides an additional massive speedup with many streams in the list. Although the lookup still has linear complexity, its absolute times are reduced and with tens to hundreds elements are lower or comparable with a typical rb-tree equivalent.

Expected speedup of SSE2 version over the previous implementation:

SSRCs    speedup (scalar)   speedup (SSE2)

1        0.39x              0.22x
3        0.57x              0.23x
5        0.69x              0.62x
10       0.77x              1.43x
20       0.86x              2.38x
30       0.87x              3.44x
50       1.13x              6.21x
100      1.25x              8.51x
200      1.30x              9.83x

At small numbers of SSRCs the new algorithm is somewhat slower, but given that the absolute and relative times of the lookup are very small, that slowdown is not very significant.

The performance benchmark is written in C++ and is attached below, along with test results on a Core i7 2600K (running Ubuntu 20.04, gcc 9.3). The test emulates the current implementation by using std::list, and estimates a potential RB-tree implementation performance by using std::set.

This PR is somewhat related to #507. It relies on the change to config_in_cmake.h made in that PR so that the SSE2 code is enabled for MSVC. Without it MSVC will use scalar code. In all other respects this PR is self-contained (i.e. non-MSVC compilers will use SSE2 intrinsics code).

This PR is related to #417. Although it doesn't change the formal O(N) complexity, it does reduce lookup times significantly and shows comparable or better performance than std::set (see the attached test results).

ssrc_map_test.tar.gz

@Lastique Lastique force-pushed the optimize_stream_list branch from a4c8d6c to f911812 Compare September 25, 2020 17:28
@pabuhler
Copy link
Member

pabuhler commented Oct 8, 2020

Hi, improving the performance of this list and associated lookup has been our list of things to do for a while.
May I ask what is the typical size of the list that you use, ie what where optimizing for? In my most common use case the size of the list is usually less than 10 with a max of around 30.

@Lastique
Copy link
Contributor Author

Lastique commented Oct 8, 2020

10-15 is a good estimate for our current common case, where we use 1 audio section (1 local and 1 remote SSRC) and up to 2 video sections with 2 or 3 simulcast substreams (1 media and 1 RTX SSRC per substream per side) in a bundle. We are planning to create more media sections, ideally bound by the number of participants in a conference, which is where the number of SSRCs may grow N-fold. We also plan to add support for FlexFEC, which will add +1 SSRC per video substream per side.

@murillo128
Copy link

I am doing media server cascading, and I can have easily more than 100 ssrcs in the same bundle. This would be a great boost for my use case.

@Lastique Lastique force-pushed the optimize_stream_list branch from f911812 to c11dcf3 Compare November 17, 2021 14:43
@Lastique
Copy link
Contributor Author

Rebased on top of the current master.

@nils-ohlmeier
Copy link
Contributor

Awesome contribution! Definitely super helpful contribution for conference servers.

I'm wondering what people think if this change should be a compile time option only. Because I assume that client side folks, e.g. the browsers and others, are not keen on a performance hit, even if it's not a big one. So the idea would be that at compile time you could choose between optimizations for low amount of RTP streams vs high amount (which hopefully should translate to client side vs server side).

@murillo128
Copy link

how about splitting the PR in two:

  • one for extracting the current ssrc lookup code into a different .c file with clean function definitions called by strp.c (should not be controversial to merge)
  • another PR for adding the SSE2 lookup functionality in a different. c file and a config flag to link against one or another

Even if the second one is not merged, server devs could easily provide their own implementation of the lookup functions

@Lastique
Copy link
Contributor Author

Lastique commented Jan 2, 2022

I don't think a compile-time option is worth the complication considering that the lookup among a few SSRCs is super fast however you implement it. Browsers have other, much more fruitful areas of optimizations.

Consider also that libraries such as libsrtp are usually built in one variant for system packaging, and that library may be used by any kind of application, server or client. So I would prefer the code base to remain unified and optimal.

However, if there are code reorganising or other requests from libsrtp maintainers that will help getting this merged, please let me know.

@eakraly
Copy link

eakraly commented Jan 19, 2022

Does not look like this question was asked: what is the change (speedup, if any) for non-SSE2 implementation?
x86-64 is not the only platform out there - I run most of the fleet on arm64 (and thus SSE2 does not apply)

I am not sure if @murillo128 intent was that but being able to add additional optimized implementations does make sense

In any case - looking forward to this change merging!

@Lastique
Copy link
Contributor Author

@eakraly The original description shows scalar performance on x86-64 platform. I don't have an ARM test system, but there is also a benchmark attached that you can run on your system.

As for optimizations for ARM/aarch64, those would be nice. I suppose, it should be doable, but you will probably have a problem with implementing PMOVMSKB in NEON or SVE, as there isn't a direct equivalent instruction. In any case, I don't have enough experience with NEON and can't test it anyway, so someone else will have to implement it in a separate PR.

@murillo128
Copy link

that's why I think splitting the PR in two, one for creating an interface for ssrc lookup and another for adding the sse2 logic would be helpful.

@Lastique
Copy link
Contributor Author

that's why I think splitting the PR in two, one for creating an interface for ssrc lookup and another for adding the sse2 logic would be helpful.

I can't separate the interface change from the implementation, as the interface implies the array-based implementation (i.e. indexes are used everywhere instead of pointers). IOW, I can't wrap the old implementation in the new interface. The SSE2 part of the PR is actually minor, most of it is plain C implementing the new container and interacting with it.

And the maintainers haven't expressed opinion if this approach is acceptable at all, SSE2 or not, or whether there are any organizational or technical changes needed. It looks like they just aren't interested.

@pabuhler
Copy link
Member

@Lastique I wouldn´t say uninterested but maybe unresponsive. We did discuss this and the other PR at some point and had some concerns around having a "splattering" of SSE2 in the code base. I understand the current solution is not optimal for a larger set of SSRC´s so we need a solution here. I will try to follow up this PR in the coming weeks. Would be good though if there is someone else that can help review the SSE2 part of the PR. I do like the idea of abstracting the list in to a new type and as @murillo128 suggested placing it in a separate file so it can easily be tested and support different implementations if needed. Also keeps the srtp.c file free of non SRTP details.

@Lastique Lastique force-pushed the optimize_stream_list branch 2 times, most recently from 0cec15e to 014d371 Compare January 21, 2022 17:59
@Lastique
Copy link
Contributor Author

I've rebased onto the current master and:

  • Separated stream list implementation to a separate .c file.
  • Separated generic implementation and SSE2 addition into different commits.

I'm still publishing this as a single PR because there is not much incentive to merge the generic array-based implementation alone. The main performance winnings come from SSE2. Also, SSE2 code depends on the previous commit that implements the generic version. However, separate commits make it easier to review the generic implementation and SSE2 separately.

Copy link

@murillo128 murillo128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good initial work, but IMHO the split between the srtp.c and the stream_list is not clear.

We need to add an implementation of the stream list that is non-SSE based (i.e. the current linked list one).

The main missing piece are iterator functions to not have to rely on pos and size where iterating through the lists

srtp/stream_list.c Outdated Show resolved Hide resolved
srtp/srtp.c Outdated Show resolved Hide resolved
srtp/srtp.c Show resolved Hide resolved
srtp/srtp.c Outdated Show resolved Hide resolved
srtp/srtp.c Outdated Show resolved Hide resolved
srtp/srtp.c Outdated Show resolved Hide resolved
@Lastique
Copy link
Contributor Author

We need to add an implementation of the stream list that is non-SSE based (i.e. the current linked list one).

The old linked list implementation is incompatible with the interface required by the SSE2 implementation and therefore is removed. The only generic (non-SSE) implementation that is proposed in this PR is the array-based one.

@Lastique Lastique force-pushed the optimize_stream_list branch from 014d371 to 810e795 Compare January 21, 2022 20:13
@nils-ohlmeier
Copy link
Contributor

Consider also that libraries such as libsrtp are usually built in one variant for system packaging, and that library may be used by any kind of application, server or client. So I would prefer the code base to remain unified and optimal.

Sure. I'm not arguing against the system packagers choose what ever they think is best for their systems. None of the browsers uses system level libraries though. They all compile their own versions into the browsers. That is why I was asking for a compile time option, and I would have no problem if the new default is to compile with the SSE2 on by default, as long as vendors who can't ship with a dependency on SSE2 support have the ability to opt out of it.
Browsers also compile their binaries for the lowest common nominator. I would have to check, but I would not be surprised if that still rules out SSE2 optimizations for them. The only things which works for them is at run time detection of CPU features like SSE2.

@Lastique
Copy link
Contributor Author

Browsers also compile their binaries for the lowest common nominator.

Chrome and Firefox require SSE2 for many years now (since 2014 and 2016, respectively). As does Windows 8 and later. Really, x86 processors without SSE2 belong to a museum now.

@Lastique
Copy link
Contributor Author

I should also note that SSE2 is not mandatory with the proposed patch. If you disable SSE2 through the compiler switches (e.g. -mno-sse2 -mno-sse in gcc and clang) or that mode is the default on your platform, the code will compile and will use the scalar implementation.

@murillo128
Copy link

As discussed in this thread, we have extracted the current stream list implementation and create an internal api + default implementation in #612

@Lastique Lastique force-pushed the optimize_stream_list branch 3 times, most recently from 1235205 to b087f46 Compare February 28, 2023 18:15
@Lastique
Copy link
Contributor Author

Rebased on top of libsrtp 2.5.0.

The implementation replaces the default linked list based implementation. As I noted before, the generic interface is not suitable for efficient implementation of this array-based list. In particular, position and list size need to be exposed to the calling code to avoid double stream lookup in srtp_remove_stream.

This patch does not attempt to fix the generic list interface, as I find this rather pointless. Instead, it optimizes the case when the internal (now - array-based) list implementation is used, which will be the majority of use cases, I think.

If the generic list interface is to be dropped (which, IMHO, is the right thing to do), I can send another patch that removes prev/next pointers from srtp_stream_ctx_t_, as well as reorders its members to remove some of the internal padding, thus reducing the size of the structure.

@Lastique
Copy link
Contributor Author

BTW, the patch now includes #643 fix.

@Lastique Lastique force-pushed the optimize_stream_list branch from b087f46 to ef519c1 Compare March 5, 2023 17:28
The stream list in the SRTP context is now implemented with two arrays:
an array of SSRCs and an array of pointers to the streams corresponding
to the SSRCs. The streams no longer form a linked list.

Stream lookup by SSRC is now performed over the array of SSRCs, which
is considerably faster because it is more cache-friendly. Additionally,
the lookup is optimized for SSE2, which provides an additional massive
speedup with many streams in the list. Although the lookup still has
linear complexity, its absolute times are reduced and with tens to
hundreds elements are lower or comparable with a typical rb-tree
equivalent.

Expected speedup of SSE2 version over the previous implementation:

SSRCs    speedup (scalar)   speedup (SSE2)

1        0.39x              0.22x
3        0.57x              0.23x
5        0.69x              0.62x
10       0.77x              1.43x
20       0.86x              2.38x
30       0.87x              3.44x
50       1.13x              6.21x
100      1.25x              8.51x
200      1.30x              9.83x

These numbers were obtained on a Core i7 2600K.

At small numbers of SSRCs the new algorithm is somewhat slower, but
given that the absolute and relative times of the lookup are very small,
that slowdown is not very significant.
@Lastique Lastique force-pushed the optimize_stream_list branch from ef519c1 to b6186f4 Compare March 14, 2023 19:58
@jmillan
Copy link
Contributor

jmillan commented Jan 16, 2024

main branch does now store streams in an array rather than in a linked list. Hence, I think this PR is as least partially outdated.

@paulej
Copy link
Contributor

paulej commented Jan 16, 2024

I wonder how much faster your code can be using intrinsics functions?

@Lastique
Copy link
Contributor Author

We're currently using (patched) libsrtp 2.5.0, and I haven't updated the PR to main yet. Not sure when I'll have time to update the PR, most likely when a new libsrtp with the new array-based stream list is released. I'll see if I can do it earlier.

From the brief look at #674, it looks quite similar to the scalar version of the stream list proposed here. The main difference is that this PR stores SSRCs in a separate array, which allows for better cache locality during lookup and, of course, is necessary for vectorized lookup. Another, admittedly minor difference is that this PR uses nomenclature that is standard in C++, i.e. size and capacity (where size means the number of elements in the container and capacity is the size of the allocated storage) instead of size and available (where size is the size of the allocated storage and available is the amount of unused storage). I'd say, the standard way is more optimal since you want the number of elements more often than the number of unused elements. This is especially so with the vectorized implementation in this PR, as the lookup result is an index, which is then compared with the number of elements to test if the stream is found.

I'll update the performance test when I update the PR.

@jmillan
Copy link
Contributor

jmillan commented Jan 17, 2024

I'd say, the standard way is more optimal since you want the number of elements more often than the number of unused elements

That makes total sense. I've made the corresponding PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants